Authors: Mauro Venticinque, Angelo Schillaci, Daniele Tambone
GitHub project: Bank-Marketing
Date: 2025-03-31
In this project, we analyze data from a Portuguese banking institution’s direct marketing campaigns to identify key factors influencing customer subscription to term deposits. The dataset includes client demographics, previous campaign interactions, and economic indicators. Our goal is to develop insights that will enhance the effectiveness of future marketing strategies. By applying supervised learning techniques, we aim to predict customer responses and optimize outreach efforts for better engagement and conversion rates.
The report will begin with an Exploratory Data Analysis, examining the variables and their relationship with the target attribute (subscribed) to identify the most influential factors.
age (Integer): age of the customerjob (Categorical): occupationmarital (Categorical): marital statuseducation (Categorical): education leveldefault (Binary): has credit in default?housing (Binary): has housing loan?loan (Binary): has personal loan?contact (Categorical): contact communication typemonth (Categorical): last contact month of yearday_of_week (Integer): last contact day of the
weekduration (Integer): last contact duration, in seconds
(numeric). Important note: this attribute highly affects the output
target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known
before a call is performed. Also, after the end of the call y is
obviously known. Thus, this input should only be included for benchmark
purposes and should be discarded if the intention is to have a realistic
predictive modelcampaign (Integer): number of contacts performed during
this campaign and for this client (numeric, includes last contact)pdays (Integer): number of days that passed by after
the client was last contacted from a previous campaign (numeric; -1
means client was not previously contacted)previous (Integer): number of contacts performed before
this campaign and for this clientpoutcome (Categorical): outcome of the previous
marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)subscribed (Binary): has the client subscribed a term
deposit?Source: UCI Machine Learning Repository
| Name | train |
| Number of rows | 32950 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| job | 0 | 1 | 6 | 13 | 0 | 12 | 0 |
| marital | 0 | 1 | 6 | 8 | 0 | 4 | 0 |
| education | 0 | 1 | 7 | 19 | 0 | 8 | 0 |
| default | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| housing | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| loan | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| contact | 0 | 1 | 8 | 9 | 0 | 2 | 0 |
| month | 0 | 1 | 3 | 3 | 0 | 10 | 0 |
| day_of_week | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
| poutcome | 0 | 1 | 7 | 11 | 0 | 3 | 0 |
| subscribed | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.04 | 10.45 | 17.00 | 32.00 | 38.00 | 47.00 | 98.00 | ▅▇▃▁▁ |
| duration | 0 | 1 | 258.66 | 260.83 | 0.00 | 102.00 | 180.00 | 318.00 | 4918.00 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.57 | 2.77 | 1.00 | 1.00 | 2.00 | 3.00 | 43.00 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 961.90 | 188.33 | 0.00 | 999.00 | 999.00 | 999.00 | 999.00 | ▁▁▁▁▇ |
| previous | 0 | 1 | 0.17 | 0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | ▇▁▁▁▁ |
| emp.var.rate | 0 | 1 | 0.08 | 1.57 | -3.40 | -1.80 | 1.10 | 1.40 | 1.40 | ▁▃▁▁▇ |
| cons.price.idx | 0 | 1 | 93.57 | 0.58 | 92.20 | 93.08 | 93.75 | 93.99 | 94.77 | ▁▆▃▇▂ |
| cons.conf.idx | 0 | 1 | -40.49 | 4.63 | -50.80 | -42.70 | -41.80 | -36.40 | -26.90 | ▅▇▁▇▁ |
| euribor3m | 0 | 1 | 3.62 | 1.74 | 0.63 | 1.34 | 4.86 | 4.96 | 5.04 | ▅▁▁▁▇ |
| nr.employed | 0 | 1 | 5167.01 | 72.31 | 4963.60 | 5099.10 | 5191.00 | 5228.10 | 5228.10 | ▁▁▃▁▇ |
The dataset includes 21 variables and 32,950 rows, with no
missing values.
Categorical variables like job and
education show good diversity, while
default, loan, and
housing have only 3 unique values.
Among numeric variables, age has a fairly normal
distribution (mean ≈ 40, sd ≈ 10), while
duration and pdays are highly skewed,
with extreme values up to 4918 and 999 respectively.
Some variables (e.g., campaign,
previous) have a low median but long tails, indicating
that most observations are clustered at low values.
Macroeconomic variables such as emp.var.rate,
euribor3m, and nr.employed are more
stable, with tight interquartile ranges, suggesting consistent economic
conditions during data collection.
Firstly we see that this dataset are unbaleanced, with the majority of people that have not subscribed.
Correlation Matrix
The correlation matrix
reveals clear patterns among the numerical variables. Notably,
euribor3m, nr.employed, and
emp.var.rate are strongly positively correlated with
each other, these suggest these variables capture similar information
about the economic environment. This should be taken into account in
predictive modeling, as using them together could lead to
multicollinearity. In contrast, variables like
campaign, pdays, and
previous show very weak correlations with most other
features, indicating they may contribute more independently to the
model.
Scatterplot Matrix
The scatterplot matrix
confirms the distribution shape and linearity of
relationships among the numeric variables. Several variables, such as
duration and pdays, show
highly skewed distributions, which could influence
model performance and may benefit from transformations (e.g., log or
binning). While some variables exhibit linear trends (e.g., euribor3m vs
nr.employed), many scatterplots show dispersed or nonlinear patterns.
This suggests that simple linear models may not fully capture the
complexity in the data.
As we can see, the age distribution is not similar across different job categories, exspecially for student that are younger than other categories and for retired that are older than other categories and have a wider range of ages, with some low value that may be disabled people.
Instead, with the education level, people that are more educated are younger than people that are less educated. This is probably due to the fact that people that are more educated spend more time studying and less time working.
Distribution of Age
The age distribution is
right-skewed, with a peak around 30–40 years old. The proportion of
people that have subscribed is higher among those over 60.This may be
due to greater financial stability in older age groups.
Distribution of Job
The distribution of the
occupation is not uniform, with the majority of people that are admin.
The proportion of people that have subscribed is among the higest
between all the occupation. This is probably due to the fact that people
that are admin have a higher income and are more likely to subscribe.
While student and retired people have a higher proportion of
subscription, this explain that we saw in the previous plot that the
older people and the people with higher education level are more likely
to subscribe.
Distribution of Education
About Education Level,
we can see that the distribution of the education level is not uniform,
with the majority of people that have a university degree. The
proportion of people that have a university degree and that have
subscribed is among the higest between all the education level. This is
probably due to the fact that people that have a university degree have
a higher income and are more likely to subscribe.
Distribution of Contacts
About previous
campaign, while most clients were not previously contacted, the success
rate is visibly higher among those who were previously contacted more
than once or had a successful prior outcome. This suggests that prior
engagement is positively associated with subscription, but they are a
small part of sample.
Distribution of Days of Week
The distribution of
the last contact day of the week is uniform, with the majority of people
that have been contacted on Thursday. The proportion of people that have
subscribed is among the higest when the last contact day of the week is
on the middle of week.
Distribution of Months
Instead, the distribution
of the last contact month of the year is not uniform, with the majority
of people that have been contacted in May. The proportion of people that
have subscribed is among the higest when the last contact month of the
year is in March, December, September and October. This is probably due
to the fact that people are more likely to subscribe when they have more
money and not during the summer.
Distribution of Duration
The duration of the
last contact is right-skewed, with a peak around 0-100 seconds. The
proportion of people that have subscribed is higher among people that
have been contacted for a longer duration. This is probably due to the
fact that people that have been contacted for a longer duration are more
interested to subscribe.
In conclusion, the analysis reveals several key insights about the factors influencing the likelihood of subscription in this dataset. First, the data is unbalanced, with the majority of individuals not subscribing to the service. Age plays a significant role, with older individuals, particularly those over 60, being more likely to subscribe, potentially due to greater financial stability. Occupation and education level also appear to influence subscription, with people in administrative roles and those with higher education showing higher subscription rates, likely due to higher income and greater financial stability.
Previous engagement with the campaign positively correlates with subscription, especially for individuals contacted multiple times or who had a successful outcome in prior campaigns. Additionally, the timing of the contact seems to affect subscription rates, with the highest rates observed in March, December, September, and October, and a higher likelihood of subscription when the contact duration is longer.
Economic factors, such as the Consumer Price Index (CPI), employment variation rate, consumer confidence index, and Euribor rate, all show significant associations with subscription rates. A lower CPI, negative employment variation rate, and a higher consumer confidence index tend to be linked to higher subscription rates, reflecting the impact of financial conditions on consumer behavior.
Overall, the analysis suggests that financial stability, previous engagement, and certain economic factors are key drivers of subscription, while factors like age, occupation, and education level also influence the likelihood of subscribing.
Social and economic context attributes:
emp.var.rate(Integer): employment variation rate - quarterly indicatorcons.price.idx(Integer): consumer price index - monthly indicatorcons.conf.idx(Integer): consumer confidence index - monthly indicatoreuribor3m(Integer): euribor 3 month rate - daily indicatornr.employed(Integer): number of employees - quarterly indicator